A Consensus-Based Fault-Tolerant Event Logger for High Performance Applications

نویسندگان

  • Edson Tavares De Camargo
  • Elias Procópio Duarte
  • Fernando Pedone
چکیده

High-performance computing (HPC) systems traditionally employ rollback-recovery techniques to allow faulttolerant executions of parallel applications. Rollback-recovery based on message logging is an attractive strategy that avoids the drawbacks of coordinated checkpointing in systems with low mean-time between failures (MTBF). Most message logging protocols rely on a centralized event logger to store information (i.e., the determinants) to allow the recovery of an application process. This centralized approach, besides the obvious single point of failure problem, represents a bottleneck for the efficiency of message logging protocols. In this work, we present a fault-tolerant distributed event logger based on consensus that outperforms the centralized approach. We implemented the event logger of MPI determinants using Paxos, a prominent consensus algorithm. Our event logger inherits the Paxos properties: safety is guaranteed even if the system is asynchronous and liveness is guaranteed despite processes failures. Experimental results are reported for the performance of the distributed event logger based both on classic Paxos and parallel Paxos applied to AMG (Algebraic MultiGrid) and NAS Parallel Benchmark applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)

Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

Improving Message Logging Protocols Scalability through Distributed Event Logging

Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existin...

متن کامل

FEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems

Ever-increasing demands of space missions for data returns from their limited processing and communications resources have made the traditional approach of data gathering, data compression, and data transmission no longer viable. Increasing on-board processing power by providing high-performance computing (HPC) capabilities using commercial-off-the-shelf (COTS) components is a promising approac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017